摘要 :
The convergence of discrete Hopfield neural networks with delay is mainly investigated and some results on the convergence are given. Several new sufficient conditions for the networks with delay converging towards a limit cycle w...
展开
The convergence of discrete Hopfield neural networks with delay is mainly investigated and some results on the convergence are given. Several new sufficient conditions for the networks with delay converging towards a limit cycle with length at most 2 are obtained. Also, several conditions for the networks with delay converging towards a limit cycle with length 2 are investigated. All results established here extend the previous results on the convergence of both discrete Hopfield neural networks without delay and with delay in parallel updating mode.
收起
摘要 :
With the fast increasing complexity of integrated circuits, verification has become the bottleneck of today's IC design flow. In fact, over 70% of the IC design turn-around time can be spent on the verification process in a typica...
展开
With the fast increasing complexity of integrated circuits, verification has become the bottleneck of today's IC design flow. In fact, over 70% of the IC design turn-around time can be spent on the verification process in a typical IC design project. Among various verification tasks, Register Transfer Level (RTL) simulation is the most widely used method to validate the correctness of digital IC designs. When simulating a large IC design with complicated internal behaviors (e.g., CPU cores running embedded software), RTL simulation can be extremely time consuming. Since RTL-to-layout is still the most prevalent IC design methodology, it is essential to speedup the RTL simulation process. Recently, General Purpose computing on Graphics Processing Units (GPGPU) is becoming a promising paradigm to accelerate computing-intensive workloads. A few recent works have demonstrated the effectiveness of using GPU to expedite gate and system level simulation tasks. In this work, we proposed an efficient GPU-accelerated RTL simulation framework. We introduce a methodology to translate Verilog RTL description into equivalent GPU source code so as to simulate circuit behavior on GPUs. In addition, a CMB based parallel simulation protocol is also adopted to provide a sufficient level of parallelism. Because RTL simulation lacks data-level parallelism, we also present a novel solution to use GPU as an efficient task-level parallel processor. Experimental results prove that our GPU based simulator outperforms a commercial sequential RTL simulator by over 20 fold.
收起
摘要 :
Multiple supply voltage (MSV) is an effective method to optimize the chip power consumption. In MSV design, the voltage island is a crucial concern that the blocks with the same voltage level are clustered into one or more voltage...
展开
Multiple supply voltage (MSV) is an effective method to optimize the chip power consumption. In MSV design, the voltage island is a crucial concern that the blocks with the same voltage level are clustered into one or more voltage islands to reduce the costs of voltage supply network and level converter. The distribution of voltage islands depends on not only the feasible voltage assignment based on timing analysis, but also the physical adjacency between blocks. In traditional post-floorplan voltage island generation approaches, the fixed layout of blocks limits the power optimization greatly. Instead of a start-over of searching process to generate better solutions, which suffers from long run time and poor scalability, in this paper, we propose an incremental power optimization methodology to further optimize the power consumption of traditional post-floorplan MSV design without compromising the circuit performance. A net-flow based timing slack distribution algorithm is proposed to obtain maximal power reduction, considering both the characters of each block and the circuit topology. Then we incrementally change the floorplan to re-construct the voltage island distribution based on the physical constraint graph, while the chip area, the total wire length and the performance are considered simultaneously. The experimental results show that our methodology can reduce the power consumption while the chip area, the total wire length and the power supply network cost are maintained at the same time.
收起
摘要 :
A binary-weight hourglass network (B-HG) accelerator for landmark detection, built on the proposed look-up-table (LUT) based multi-level prediction-correction approach, is enabled for high-speed and energy-efficient processing on ...
展开
A binary-weight hourglass network (B-HG) accelerator for landmark detection, built on the proposed look-up-table (LUT) based multi-level prediction-correction approach, is enabled for high-speed and energy-efficient processing on IoT edge devices. First, LUT with a unified mode is adopted to support convolutional neural network with fully variable weight bit precision to minimize operations of B-HG, which achieves $1.33\times-1.50\times$ speedup on multi-bit weight CNN relative to the similar solution. Second, multi-level prediction-correction model is proposed to achieve computational-efficient convolution with adaptive precision. The operations saved can be increase by about 30% than the two-stage model. Besides, nearly 77.4% of the operations in B-HG can be saved by using the combination of these two methods, yielding a 2.3× inference speedup. Third, block computing based pipeline is designed to improve the residual block deficiency in B-HG. It can not only reduce about 66.2% off-chip memory access than the baseline, but also save 60% and 31% on-chip memory space and access compared to the similar fused-layer accelerator. The proposed B-HG accelerator achieves 450 fps at 500MHz based on the simulation in TSMC 28 nm process. Meanwhile, the power efficiency is up to 8.5 TOPS/W, which is two orders of magnitude higher than the dedicated face landmark detection accelerator.
收起
摘要 :
A binary-weight hourglass network (B-HG) accelerator for landmark detection, built on the proposed look-up-table (LUT) based multi-level prediction-correction approach, is enabled for high-speed and energy-efficient processing on ...
展开
A binary-weight hourglass network (B-HG) accelerator for landmark detection, built on the proposed look-up-table (LUT) based multi-level prediction-correction approach, is enabled for high-speed and energy-efficient processing on IoT edge devices. First, LUT with a unified mode is adopted to support convolutional neural network with fully variable weight bit precision to minimize operations of B-HG, which achieves $1.33\times-1.50\times$ speedup on multi-bit weight CNN relative to the similar solution. Second, multi-level prediction-correction model is proposed to achieve computational-efficient convolution with adaptive precision. The operations saved can be increase by about 30% than the two-stage model. Besides, nearly 77.4% of the operations in B-HG can be saved by using the combination of these two methods, yielding a 2.3× inference speedup. Third, block computing based pipeline is designed to improve the residual block deficiency in B-HG. It can not only reduce about 66.2% off-chip memory access than the baseline, but also save 60% and 31% on-chip memory space and access compared to the similar fused-layer accelerator. The proposed B-HG accelerator achieves 450 fps at 500MHz based on the simulation in TSMC 28 nm process. Meanwhile, the power efficiency is up to 8.5 TOPS/W, which is two orders of magnitude higher than the dedicated face landmark detection accelerator.
收起
摘要 :
Face detection and alignment are highly-correlated, computation-intensive tasks, without being flexibly supported by any facial-oriented accelerator yet. This work proposes the first unified accelerator for multi-face detection an...
展开
Face detection and alignment are highly-correlated, computation-intensive tasks, without being flexibly supported by any facial-oriented accelerator yet. This work proposes the first unified accelerator for multi-face detection and alignment, along with the optimizations on multi-task cascaded convolutional networks algorithm, to implement both multi-face detection and alignment. First, the clustering non-maximum suppression is proposed to significantly reduce intersection over union computation and eliminate the hardware-interfer-ence sorting process, bringing 16.0% speed-up without any loss. Second, a new pipeline architecture is presented to implement the proposal network in more computation-efficient manner, with 41.7% less multiplier usage and 38.3% decrease in memory capacity compared with the similar method. Third, a batch schedule mechanism is proposed to improve hardware utilization of fully-connected layer by 16.7% on average with variable input number in batch process. Based on the TSMC 28 nm CMOS process, this accelerator only consumes 6.7ms at 400 MHz to simultaneously process 5 faces for each image and achieves 1.17 TOPS/W power efficiency, which is $54.8 \times $ higher than the state-of-the-art solution.
收起
摘要 :
Face detection and alignment are highly-correlated, computation-intensive tasks, without being flexibly supported by any facial-oriented accelerator yet. This work proposes the first unified accelerator for multi-face detection an...
展开
Face detection and alignment are highly-correlated, computation-intensive tasks, without being flexibly supported by any facial-oriented accelerator yet. This work proposes the first unified accelerator for multi-face detection and alignment, along with the optimizations on multi-task cascaded convolutional networks algorithm, to implement both multi-face detection and alignment. First, the clustering non-maximum suppression is proposed to significantly reduce intersection over union computation and eliminate the hardware-interfer-ence sorting process, bringing 16.0% speed-up without any loss. Second, a new pipeline architecture is presented to implement the proposal network in more computation-efficient manner, with 41.7% less multiplier usage and 38.3% decrease in memory capacity compared with the similar method. Third, a batch schedule mechanism is proposed to improve hardware utilization of fully-connected layer by 16.7% on average with variable input number in batch process. Based on the TSMC 28 nm CMOS process, this accelerator only consumes 6.7ms at 400 MHz to simultaneously process 5 faces for each image and achieves 1.17 TOPS/W power efficiency, which is $54.8 \times $ higher than the state-of-the-art solution.
收起
摘要 :
While deep convolutional neural networks (CNNs) have emerged as the driving force of a wide range of domains, their computationally and memory intensive natures hinder the further deployment in mobile and embedded applications. Re...
展开
While deep convolutional neural networks (CNNs) have emerged as the driving force of a wide range of domains, their computationally and memory intensive natures hinder the further deployment in mobile and embedded applications. Recently, CNNs with low-precision parameters have attracted much research attention. Among them, multiplier-free binary- and ternary-weight CNNs are reported to be of comparable recognition accuracy with full-precision networks, and have been employed to improve the hardware efficiency. However, even with the weights constrained to binary and ternary values, large-scale CNNs still require billions of operations in a single forward propagation pass.In this paper, we introduce a novel approach to maximally eliminate redundancy in binary- and ternary-weight CNN inference, improving both the performance and energy efficiency. The initial kernels are transformed into much fewer and sparser ones, and the output feature maps are rebuilt from the immediate results. Overall, the number of total operations in convolution is reduced. To find an efficient transformation solution for each already trained network, we propose a searching algorithm, which iteratively matches and eliminates the overlap in a set of kernels. We design a specific hardware architecture to optimize the implementation of kernel transformation. Specialized dataflow and scheduling method are proposed. Tested on SVHN, AlexNet, and VGG-16, our architecture removes 43.4\%-79.9\% operations, and speeds up the inference by 1.48-3.01 times.
收起
摘要 :
While deep convolutional neural networks (CNNs) have emerged as the driving force of a wide range of domains, their computationally and memory intensive natures hinder the further deployment in mobile and embedded applications. Re...
展开
While deep convolutional neural networks (CNNs) have emerged as the driving force of a wide range of domains, their computationally and memory intensive natures hinder the further deployment in mobile and embedded applications. Recently, CNNs with low-precision parameters have attracted much research attention. Among them, multiplier-free binary- and ternary-weight CNNs are reported to be of comparable recognition accuracy with full-precision networks, and have been employed to improve the hardware efficiency. However, even with the weights constrained to binary and ternary values, large-scale CNNs still require billions of operations in a single forward propagation pass.In this paper, we introduce a novel approach to maximally eliminate redundancy in binary- and ternary-weight CNN inference, improving both the performance and energy efficiency. The initial kernels are transformed into much fewer and sparser ones, and the output feature maps are rebuilt from the immediate results. Overall, the number of total operations in convolution is reduced. To find an efficient transformation solution for each already trained network, we propose a searching algorithm, which iteratively matches and eliminates the overlap in a set of kernels. We design a specific hardware architecture to optimize the implementation of kernel transformation. Specialized dataflow and scheduling method are proposed. Tested on SVHN, AlexNet, and VGG-16, our architecture removes 43.4\%-79.9\% operations, and speeds up the inference by 1.48-3.01 times.
收起
摘要 :
Deep convolutional neural networks (DCNNs) have been widely used in various AI applications. Inception and Residual are two promising structures adopted in many important modern DCNN models, including AlphaGo Zero's model. These s...
展开
Deep convolutional neural networks (DCNNs) have been widely used in various AI applications. Inception and Residual are two promising structures adopted in many important modern DCNN models, including AlphaGo Zero's model. These structures allow considerably increasing the depth and width of the network to improve accuracy, without increasing the computational budget or the difficulty of convergence. Various accelerators for DCNNs have been proposed based on FPGA platform because it has advantages of high performance, good power efficiency, and fast development round, etc. However, previous FPGA mapping methods cannot fully adapt to the different data localities among layers and other characteristics of Inception and Residual, which leads to a under-utilization of FPGA resources. We propose LCP, a Layer Clusters Paralleling mapping method to classify the layers into clusters based on their differences of parameters and data localities, and then accelerate them in different partitions of FPGA. We evaluate our mapping method by implementing Inception/Residual modules from GoogLeNet [8] and ResNet-50 [4] on Xilinx VC709 (Virtex 690T) FPGA. The results show that the proposed method fully utilizes resources and achieves up to 4.03× performance than the baseline and 2.00× performance than the state-of-the-art methods.
收起